Github Repository Link

Background and Motivation

Reddit and Stackoverflow are two community based websites which are used heavily by people all across the world. Reddit is known as the “Front Page of the Internet” and is a popular forum especially among young people where users can post anything and everything.It has a large international community and a lot of programming related content. Reddit is a place for all possible topics of discussion whereas Stack Overflow is centered around programming languages, ideas and discussions surrounding it. Stack overflow is more of a question answer based system. Reddit is mostly a post and comments based system. Both of these websites have an upvote/downvote system and are great resources for coders and programmers around the globe.

We want to use data from the Reddit forum in order to better understand the popularity of Programming languages among Reddit users.Additionally, we want to compare it to data from the Stack Overflow forum. We want to evaluate what programming languages are being discussed in both forums and compare how their usage or popularity has changed over time. For reaching our aim we want to use Visualization and Machine Learning methods based on text data but also use the quantification that we get from the upvotes and number of comments on both the platforms.

Since both platforms are very popular and most sought after and contain a lot of discussion related to programming, it would be good to study and analyse the trend of how users have been using some of the topmost programming languages. The change of trend would help us in understanding how the popularity of programming language has changed over time and also we could then predict which programming language would be centre of discussion or most queried of in these two platforms in near future.

Project Objective

The objective of our project is to find a correlation between the different parameters of questions or posts in these platforms and try to calculate how the popularity of programming language has changed over time. And then further we would like to predict what would be the trends of these programming languages in near future. We could use this information to also relate the kinds of problems or topics which are most related with this programming languages and analyze the kind of topics or solutions most used for certain kinds of problems. We would try to answer the following research questions :

Research Questions

  • Can we decide which topic/language a certain post is about?
  • How do number of upvotes, comments and number of posts correlate to popularity?
  • How does the popularity of programming languages change over time?
  • Can we predict the popularity of programming languages in the future?
  • How do the two platforms compare based on programming languages?

Datasets

For the Reddit posts the plan is to use an API from Reddit to get data sets for a certain time range and a number of specific Subreddits. The choice of the Subreddits is crucial for the quality and expressiveness of our data and will be based on some prior research on interesting Subreddits regarding programming. From this data we can then get the Subreddit, title, text, upvotes and various metadata.

For Stackoverflow we plan to use the datadumps available on internet archive and then merge them and further preprocess to use it for analysis, visualisation and prediction tasks.

Data set example (Reddit Dataset)

data.subreddit data.title data.id data.created data.created_utc data.upvote_ratio data.ups data.score data.num_comments
coding I built this Lottie animation editor to edit Lottie animations without After Effects! If like me, you use Lottie animations as part of your frontend UI but struggle with After Effects or implementation issues, would like to know what you think nh9900 1621567880 1621539080 0.88 13 13 1
coding Back-End VS Front-End Framework | 6 J.S. Frameworks Experts Love - Untied Blogs nh0yzf 1621547972 1621519172 0.33 0 0 1
coding File Descriptor Limits ngzeep 1621543958 1621515158 0.40 0 0 0

Design overview

Data Preprocessing

  • Stackoverflow Data Dumps : https://archive.org/download/stackexchange
    • Using the different stackoverflow dumps ( posts, tags, users, votes,posthistory,comments etc.)
    • Merging this separate data dumps to get relevant data and make it comparable with Reddit data before using it for our analysis and visualisation
  • Reddit Data :
    • Fetching the Reddit data from the pushshift API
    • Finding relevant subreddits
    • Selecting the features that should be used
    • Textpreprocessing:
      • Removing punctuation and stopwords
      • Stemming
      • Vectorization

Explorative Data Analyis and Visualization

  • Exploratory data analysis using box plots, histograms, scatter plot etc. on upvotes and number of comments and number of posts for every programming language
  • Showing the trend with a line plot and confidence interval
  • Using 2D embedding of the posts and using a scatterplot to show similarity
  • Comparing stackoverflow and Reddit in the plots

Algorithms

  • Clustering on posts : K-Means Algorithm
  • Topic Modeling on individual clusters : LDA - Latent Dirichlet Allocation
  • Classifying relevant/irrelevant Posts : Decision trees or ensemble methods
  • Using sentiment analysis and clustering to evaluate if a post is a professional question or an opinion

Prediction of future popularity

  • Auto Regressive Integrated Moving Average (ARIMA)
  • Exponential Smoothing
  • Random Forest

Shiny application

  • Using shiny as a tool for interactive exploration of the data set
    • Sliders for selecting a time range of interest
    • Checkboxes or Dropdown menus to select different programming languages
    • Dropdown to visualize wordcloud on different clusters
    • Further enhancements based on data visualisation
  • Bonus: Shiny offers the opportunity of refreshing the data sets on a web page making it possible to get new insights everyday

Time plan